Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [https://machinelearningmastery.com/]
SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Crop Mapping in Canada dataset is a multi-class classification situation where we are trying to predict one of several (more than two) possible outcomes.
INTRODUCTION: This data set is a fused bi-temporal optical-radar data for cropland classification. The organization collected the images using RapidEye satellites (optical) and the Unmanned Aerial Vehicle Synthetic Aperture Radar (UAVSAR) system (Radar) over an agricultural region near Winnipeg, Manitoba, Canada in 2012. There are 2 49 radar features and 2 38 optical features for two dates: 05 and 14 July 2012. Seven crop type classes exist for this data set as follows: 1-Corn; 2-Peas; 3-Canola; 4-Soybeans; 5-Oats; 6-Wheat; and 7-Broadleaf.
In this Take1 iteration, we will construct and tune machine learning models for this dataset using the Scikit-Learn library. We will observe the best accuracy result that we can obtain using the tuned models with the training and test datasets.
ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 94.87%. Two algorithms (Extra Trees and Random Forest) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Extra Trees turned in a better overall result. Extra Trees achieved an accuracy metric of 99.71%. When configured with the optimized parameters, the Extra Trees model processed the testing dataset with an accuracy of 99.74%, which was even better than the prediction accuracy from the training data.
CONCLUSION: For this iteration, the Extra Trees model achieved the best overall results using the training and test datasets. For this dataset, we should consider using the Extra Trees algorithm for further modeling.
Dataset Used: Crop Mapping in Canada Data Set
Dataset ML Model: Multi-Class classification with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Crop+mapping+using+fused+optical-radar+data+set
Any predictive modeling machine learning project generally can be broken down into about six major tasks:
# Install the necessary packages for Colab
# !pip install python-dotenv PyMySQL
# Retrieve the GPU information from Colab
# gpu_info = !nvidia-smi
# gpu_info = '\n'.join(gpu_info)
# if gpu_info.find('failed') >= 0:
# print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
# print('and then re-execute this cell.')
# else:
# print(gpu_info)
# Retrieve the memory configuration from Colab
# from psutil import virtual_memory
# ram_gb = virtual_memory().total / 1e9
# print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))
# if ram_gb < 20:
# print('To enable a high-RAM runtime, select the Runtime → "Change runtime type"')
# print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
# print('re-execute this cell.')
# else:
# print('You are using a high-RAM runtime!')
# Retrieve the CPU information
ncpu = !nproc
print("The number of available CPUs is:", ncpu[0])
# Set the random seed number for reproducible results
seedNum = 888
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import sys
import math
import smtplib
import boto3
from datetime import datetime
from dotenv import load_dotenv
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
# from sklearn.pipeline import Pipeline
# from sklearn.feature_selection import RFE
# from imblearn.over_sampling import SMOTE
# from imblearn.combine import SMOTEENN
# from imblearn.combine import SMOTETomek
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
# from imblearn.ensemble import BalancedRandomForestClassifier
# from imblearn.ensemble import RUSBoostClassifier
# from imblearn.ensemble import BalancedBaggingClassifier
# from xgboost import XGBClassifier
# Begin the timer for the script processing
startTimeScript = datetime.now()
# Set up the number of CPU cores available for multi-thread processing
n_jobs = 2
# Set up the flag to stop sending progress emails (setting to True will send status emails!)
notifyStatus = True
# Set up the parent directory location for loading the dotenv files
useColab = False
if useColab:
# Mount Google Drive locally for storing files
from google.colab import drive
drive.mount('/content/gdrive')
gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
env_path = '/content/gdrive/My Drive/Colab Notebooks/'
dotenv_path = env_path + "python_script.env"
load_dotenv(dotenv_path=dotenv_path)
# Set up the dotenv file for retrieving environment variables
useLocalPC = False
if useLocalPC:
env_path = "/Users/david/PycharmProjects/"
dotenv_path = env_path + "python_script.env"
load_dotenv(dotenv_path=dotenv_path)
# Configure the plotting style
plt.style.use('seaborn')
# Set Pandas options
pd.set_option("display.max_rows", 500)
pd.set_option("display.width", 140)
# Set the flag for splitting the dataset
splitDataset = True
splitPercentage = 0.25
# Set the number of folds for cross validation
n_folds = 5
# Set various default modeling parameters
scoring = 'accuracy'
# Set up the email notification function
def status_notify(msg_text):
access_key = os.environ.get('SNS_ACCESS_KEY')
secret_key = os.environ.get('SNS_SECRET_KEY')
aws_region = os.environ.get('SNS_AWS_REGION')
topic_arn = "arn:aws:sns:us-east-1:072417399597:PythonMLScriptNotification"
if (access_key is None) or (secret_key is None) or (aws_region is None):
sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
response = sns.publish(TopicArn=topic_arn, Message=msg_text)
if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])
if notifyStatus: status_notify("Task 1 - Prepare Environment has begun! " + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
dataset_path = 'https://dainesanalytics.com/datasets/crop-mapping-winnipeg/WinnipegDataset.txt'
Xy_original = pd.read_csv(dataset_path)
# Take a peek at the dataframe after import
Xy_original.head(10)
Xy_original.info(verbose=True)
Xy_original.describe()
Xy_original.isnull().sum()
print('Total number of NaN in the dataframe: ', Xy_original.isnull().sum().sum())
# Standardize the class column to the name of targetVar if required
Xy_original = Xy_original.rename(columns={'label': 'targetVar'})
# Convert columns from one data type to another
Xy_original['f137'] = Xy_original['f137'].astype('float')
Xy_original['f138'] = Xy_original['f138'].astype('float')
Xy_original['f139'] = Xy_original['f139'].astype('float')
Xy_original['f140'] = Xy_original['f140'].astype('float')
Xy_original['f141'] = Xy_original['f141'].astype('float')
# Take a peek at the dataframe after cleaning
Xy_original.head(10)
Xy_original.info(verbose=True)
Xy_original.describe()
Xy_original.isnull().sum()
print('Total number of NaN in the dataframe: ', Xy_original.isnull().sum().sum())
# Use variable totCol to hold the number of columns in the dataframe
totCol = len(Xy_original.columns)
# Set up variable totAttr for the total number of attribute columns
totAttr = totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# If (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization
targetCol = 1
# We create attribute-only and target-only datasets (X_original and y_original)
# for various visualization and cleaning/transformation operations
if targetCol == totCol:
X_original = Xy_original.iloc[:,0:totAttr]
y_original = Xy_original.iloc[:,totAttr]
else:
X_original = Xy_original.iloc[:,1:totCol]
y_original = Xy_original.iloc[:,0]
print("Xy_original.shape: {} X_original.shape: {} y_original.shape: {}".format(Xy_original.shape, X_original.shape, y_original.shape))
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol = 4
if totAttr % dispCol == 0 :
dispRow = totAttr // dispCol
else :
dispRow = (totAttr // dispCol) + 1
# Set figure width to display the data visualization plots
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = dispCol*4
fig_size[1] = dispRow*4
plt.rcParams["figure.figsize"] = fig_size
if notifyStatus: status_notify("Task 1 - Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
if notifyStatus: status_notify("Task 2 - Summarize and Visualize Data has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
X_original.head(10)
X_original.info(verbose=True)
X_original.describe()
Xy_original.groupby('targetVar').size()
# Histograms for each attribute
X_original.hist(layout=(dispRow,dispCol))
plt.show()
# Box and Whisker plot for each attribute
X_original.plot(kind='box', subplots=True, layout=(dispRow,dispCol))
plt.show()
# # Correlation matrix
# fig = plt.figure(figsize=(16,12))
# ax = fig.add_subplot(111)
# correlations = X_original.corr(method='pearson')
# cax = ax.matshow(correlations, vmin=-1, vmax=1)
# fig.colorbar(cax)
# plt.show()
if notifyStatus: status_notify("Task 2 - Summarize and Visualize Data completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
if notifyStatus: status_notify("Task 3 - Pre-process Data has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
# Split the data further into training and test datasets
if splitDataset:
X_train_df, X_test_df, y_train_df, y_test_df = train_test_split(X_original, y_original, test_size=splitPercentage,
stratify=y_original, random_state=seedNum)
else:
X_train_df, y_train_df = X_original, y_original
X_test_df, y_test_df = X_original, y_original
print("X_train.shape: {} y_train_df.shape: {}".format(X_train_df.shape, y_train_df.shape))
print("X_test_df.shape: {} y_test_df.shape: {}".format(X_test_df.shape, y_test_df.shape))
# Histograms for each attribute before pre-processing
columns_to_scale = X_original.columns[X_original.dtypes == 'float64'].tolist()
X_original[columns_to_scale].hist(layout=(dispRow,dispCol))
plt.show()
# Apply feature scaling and transformation
print('Columns to scale are:', columns_to_scale)
scaler = preprocessing.MinMaxScaler()
X_original[columns_to_scale] = scaler.fit_transform(X_original[columns_to_scale])
print(X_original.head())
# Histograms for each attribute after pre-processing
X_original[columns_to_scale].hist(layout=(dispRow,dispCol))
plt.show()
# Not applicable for this iteration of the project
# Not applicable for this iteration of the project
# Finalize the training and testing datasets for the modeling activities
X_train = X_train_df.to_numpy()
y_train = y_train_df.ravel()
X_test = X_test_df.to_numpy()
y_test = y_test_df.ravel()
print("X_train.shape: {} y_train.shape: {}".format(X_train.shape, y_train.shape))
print("X_test.shape: {} y_test.shape: {}".format(X_test.shape, y_test.shape))
if notifyStatus: status_notify("Task 3 - Pre-process Data completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
if notifyStatus: status_notify("Task 4 - Train and Evaluate Models has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
# Set up Algorithms Spot-Checking Array
startTimeTraining = datetime.now()
startTimeModule = datetime.now()
train_models = []
train_results = []
train_model_names = []
train_metrics = []
train_models.append(('LDA', LinearDiscriminantAnalysis()))
train_models.append(('CART', DecisionTreeClassifier(random_state=seedNum)))
train_models.append(('KNN', KNeighborsClassifier(n_jobs=n_jobs)))
train_models.append(('BGT', BaggingClassifier(random_state=seedNum, n_jobs=n_jobs)))
train_models.append(('RNF', RandomForestClassifier(random_state=seedNum, n_jobs=n_jobs)))
train_models.append(('EXT', ExtraTreesClassifier(random_state=seedNum, n_jobs=n_jobs)))
# train_models.append(('XGB', XGBClassifier(random_state=seedNum, objective='multi:softmax', num_class=7, tree_method='gpu_hist')))
# Generate model in turn
for name, model in train_models:
if notifyStatus: status_notify("Algorithm "+name+" modeling has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
startTimeModule = datetime.now()
kfold = KFold(n_splits=n_folds, shuffle=True, random_state=seedNum)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring, n_jobs=n_jobs, verbose=1)
train_results.append(cv_results)
train_model_names.append(name)
train_metrics.append(cv_results.mean())
print("%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()))
print(model)
print ('Model training time:', (datetime.now() - startTimeModule), '\n')
if notifyStatus: status_notify("Algorithm "+name+" modeling completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
print ('Average metrics ('+scoring+') from all models:',np.mean(train_metrics))
print ('Total training time for all models:',(datetime.now() - startTimeTraining))
fig = plt.figure(figsize=(16,12))
fig.suptitle('Algorithm Comparison - Spot Checking')
ax = fig.add_subplot(111)
plt.boxplot(train_results)
ax.set_xticklabels(train_model_names)
plt.show()
if notifyStatus: status_notify("Task 4 - Train and Evaluate Models completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
if notifyStatus: status_notify("Task 5 - Fine-tune and Improve Models has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
# Set up the comparison array
tune_results = []
tune_names = []
# Tuning algorithm #1 - Extra Trees
startTimeModule = datetime.now()
if notifyStatus: status_notify("Algorithm #1 tuning has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
tune_model1 = ExtraTreesClassifier(random_state=seedNum, n_jobs=n_jobs)
tune_names.append('EXT')
paramGrid1 = dict(n_estimators=np.array([10, 20, 50, 80, 100]))
kfold = KFold(n_splits=n_folds, shuffle=True, random_state=seedNum)
grid1 = GridSearchCV(estimator=tune_model1, param_grid=paramGrid1, scoring=scoring, cv=kfold, n_jobs=n_jobs, verbose=1)
grid_result1 = grid1.fit(X_train, y_train)
print("Best: %f using %s" % (grid_result1.best_score_, grid_result1.best_params_))
tune_results.append(grid_result1.cv_results_['mean_test_score'])
means = grid_result1.cv_results_['mean_test_score']
stds = grid_result1.cv_results_['std_test_score']
params = grid_result1.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
print ('Model training time:',(datetime.now() - startTimeModule))
if notifyStatus: status_notify("Algorithm #1 tuning completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
best_paramKey1 = list(grid_result1.best_params_.keys())[0]
best_paramValue1 = list(grid_result1.best_params_.values())[0]
print("Captured the best parameter for algorithm #1:", best_paramKey1, '=', best_paramValue1)
# Tuning algorithm #2 - Random Forest
startTimeModule = datetime.now()
if notifyStatus: status_notify("Algorithm #2 tuning has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
tune_model2 = RandomForestClassifier(random_state=seedNum, n_jobs=n_jobs)
tune_names.append('RNF')
paramGrid2 = dict(n_estimators=np.array([10, 20, 50, 80, 100]))
kfold = KFold(n_splits=n_folds, shuffle=True, random_state=seedNum)
grid2 = GridSearchCV(estimator=tune_model2, param_grid=paramGrid2, scoring=scoring, cv=kfold, n_jobs=n_jobs, verbose=1)
grid_result2 = grid2.fit(X_train, y_train)
print("Best: %f using %s" % (grid_result2.best_score_, grid_result2.best_params_))
tune_results.append(grid_result2.cv_results_['mean_test_score'])
means = grid_result2.cv_results_['mean_test_score']
stds = grid_result2.cv_results_['std_test_score']
params = grid_result2.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
print ('Model training time:',(datetime.now() - startTimeModule))
if notifyStatus: status_notify("Algorithm #2 tuning completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
best_paramKey2 = list(grid_result2.best_params_.keys())[0]
best_paramValue2 = list(grid_result2.best_params_.values())[0]
print("Captured the best parameter for algorithm #2:", best_paramKey2, '=', best_paramValue2)
fig = plt.figure(figsize=(16,12))
fig.suptitle('Algorithm Comparison - Post Tuning')
ax = fig.add_subplot(111)
plt.boxplot(tune_results)
ax.set_xticklabels(tune_names)
plt.show()
if notifyStatus: status_notify("Task 5 - Fine-tune and Improve Models completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
if notifyStatus: status_notify("Task 6 - Finalize Model and Present Analysis has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
test_model1 = ExtraTreesClassifier(n_estimators=best_paramValue1, random_state=seedNum, n_jobs=n_jobs)
test_model1.fit(X_train, y_train)
predictions1 = test_model1.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, predictions1))
print(confusion_matrix(y_test, predictions1))
print(classification_report(y_test, predictions1))
print(test_model1)
test_model2 = RandomForestClassifier(n_estimators=best_paramValue2, random_state=seedNum, n_jobs=n_jobs)
test_model2.fit(X_train, y_train)
predictions2 = test_model2.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, predictions2))
print(confusion_matrix(y_test, predictions2))
print(classification_report(y_test, predictions2))
print(test_model2)
# Combining the training and testing datasets to form the complete dataset that will be used for training the final model
# X_complete = np.vstack((X_train, X_test))
# y_complete = np.concatenate((y_train, y_test))
# print("X_complete.shape: {} y_complete.shape: {}".format(X_complete.shape, y_complete.shape))
# final_model = test_model1.fit(X_complete, y_complete)
# print(final_model)
# modelName = 'FinalModel_MultiClass.sav'
# dump(final_model, modelName)
if notifyStatus: status_notify("Task 6 - Finalize Model and Present Analysis completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
print ('Total time for the script:',(datetime.now() - startTimeScript))